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Abstract 


Effective,  cost-efficient  testing  is  critical  to  the  long-term  success  of  open  architecture 
within  the  Navy’s  Integrated  Warfare  System.  In  previous  research  we  have 
developed  a  simple,  effective  framework  to  examine  the  testing  of  complex  systems. 
This  model  and  its  prototype  decision  aid  provide  a  rigorous  yet  tractable  approach 
to  improve  system  testing,  and  to  better  understand  and  document  the  system  and 
component  interdependencies  across  the  enterprise.  An  integral  part  of  this  model  is 
characterizing  test  coverages  on  modules.  Using  idealized  simulations  of  complex 
systems,  we  investigate  the  sensitivity  of  test  selection  strategy  to  the  precision  with 
which  these  coverages  are  specified.  Monte  Carlo  analysis  indicates  that  best-test 
selection  strategies  are  somewhat  sensitive  to  the  precision  of  test  coverage 
specification,  suggesting  significant  impact  on  testing  under  fixed-cost  constraint. 
These  results  provide  significant  insight  as  we  extend  this  work  with  further  study  of 
real-world  systems  by  applying,  and  refining,  the  mathematical  analysis  and 
computer  simulation  within  this  framework.  The  current  decision-aid  software  will  be 
further  developed  using  these  operational  test  and  evaluation  data,  improving  the 
fidelity  of  the  current  modeling  while  making  available  to  program  managers  and 
system  designers  a  usable  and  relevant  tool  for  test-retest  decisions. 
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Overview 


In  previous  research  we  have  developed  a  framework  for  describing  the  performance 
of  a  test  suite  for  assessment  of  a  complex  system  under  repair  or  under  routine 
maintenance  (Pfeiffer,  Kanevsky,  &  Housel,  2009a,  2009b,  2010).  This  model  was  then 
implemented  in  a  decision  support  tool  to  investigate  strategies  for  test  selection  under  fixed 
cost  or  fixed  reliability  constraints.  Construction  of  the  model  for  simulation  required  the 
characterization  of  test  coverages  on  modules;  that  is,  we  needed  an  a  priori  estimate  of 
how  much  of  the  module  was  exercised  by  a  particular  test. 

For  hardware  modules,  this  test  coverage  is  reasonably  simple  to  estimate  (see,  for 
example,  Barford,  Kanevsky,  &  Kamas,  2004).  For  software  systems,  however,  the  notion 
of  test  coverage  is  more  problematic  and  may  require  more  knowledge  of  the  internal 
structure  of  the  modules  and  their  interdependencies  (Zhu,  Hall,  &  May,  1997).  Although 
the  notion  of  software  testing  is  well  studied,  the  characterization  of  test  coverage  can  vary 
widely  among  investigators  (compare,  for  example,  Leung  &  White,  1991;  White  &  Leung, 
1992;  Weyuker,  1998;  Tsai,  2001;  Rothermel,  Untch,  &  Harrold,  2001;  and  Mao  &  Lu,  2005). 
Often,  internal  knowledge  of  hardware  and  software  modules  will  not  be  available  to 
developers  of  integrated  test  suites,  particularly  with  commercial-off-the-shelf  (COTS) 
technologies.  The  increasing  use  of  COTS  in  current  weapons  systems  (Caruso,  1995; 
Dalcher,  2000),  coupled  with  the  complexity  of  end-to-end  systems  (Athans,  1987;  Brazet, 
1993),  suggests  that  characterizing  test  coverages  in  an  open  architecture  system  will 
remain  a  significant  challenge. 

How  important  are  these  test  coverages  to  developing  an  effective  test  strategy? 

That  is,  how  precisely  and  how  accurately  must  we  specify  these  coverages  to  evaluate 
effective  test  strategies?  Extending  our  previous  work,  we  investigate  the  sensitivity  of  test 
selection  strategy  to  the  characterization  of  test  coverages  within  the  system  under  test. 
Using  an  analytic  approach  to  inform  further  modeling  and  simulation  work,  we  seek  to 
better  understand  how  well  we  must  specify  these  a  priori  coverage  estimates  in  order  to 
derive  useful  testing  strategies. 

The  rest  of  this  paper  is  organized  as  follows:  Background  will  briefly  review  the 
framework  we  have  developed  for  investigating  testing  strategies;  Analytic  Modeling  and 
Computer  Simulation  Approach  will  discuss  the  analytic  background  and  simulation 
approach  in  examining  the  sensitivity  of  information  returned  to  the  coverage  specification; 
Simulation  Results  will  describe  simulation  results  and  significant  findings;  and  Conclusions 
and  Future  Work  will  discuss  future  avenues  for  research. 

Background 

In  the  present  discussion,  we  define  testing  as  the  mechanism  by  which  we  trade 
some  fixed  cost  (e.g.,  time,  money)  for  information  about  the  state  of  subcomponents  and 
overall  reliability  of  our  system  (Figure  1 ).  In  general,  we  seek  the  maximum  information 
available  for  the  minimum  cost. 
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Good  testing  strategies  offer  the 
most  information  per  unit  cost 


Low 


Cost  of  testing  in  budget 
and  schedule 


High 


Figure  1.  Information  Returned  for  Cost  of  Testing  Executed 

Note.  An  idealized  representation  of  testing  strategies  in  terms  of  information  returned  for 
testing  accomplished.  Each  solid  line  represents  a  particular  testing  strategy,  with  better 
strategies  distinguished  by  steeper  ascent  or  greater  information  return  per  unit  cost. 

Mathematical  models  proposed  by  von  Neumann  (1952)  and  Moore  and  Shannon 
(1956a,  1956b)  shaped  much  of  the  early  work  on  component  and  system  reliability.  An 
early  focus  on  fault  diagnosis,  particularly  in  electro-mechanical  systems,  characterized  work 
by  Sobel  and  Groll  (1966),  Butterworth  (1972),  Garey  (1972),  Fishman  (1990),  and  others,  in 
what  is  often  known  as  the  test-sequencing  problem.  That  is,  which  test  sequence  most 
cost-efficiently  arrives  at  a  correct  diagnosis  in  a  failed  system? 

In  software  engineering,  we  are  most  often  faced  not  with  a  failed  system,  but  with  a 
large  system  undergoing  maintenance.  Testing  in  this  situation  is  used  to  establish  that  no 
defect  has  been  added  to  the  system  by  these  engineering  upgrades.  This  regression 
testing,  or  test-retest  dilemma,  can  be  more  difficult  than  diagnostic  testing  of  a  failed 
system,  because  by  its  nature,  testing  cannot  absolutely  demonstrate  that  no  defect  exists 
(Dijkstra,  1972).  A  good  test  suite  and  good  testing  strategy,  however,  can  often 
demonstrate  that  a  defect  is  highly  unlikely  in  the  system  under  test  (Zhu  et  al.,  1997). 

In  previous  work  (Pfeiffer  et  al.,  2009a,  2009b,  2010)  we  have  developed  a  unified 
modeling  framework  with  risk  and  cost  as  the  common  tension  regulating  the  degree  of 
testing  required.  The  cost  of  testing  can  be  evaluated  in  terms  of  dollars,  or  time,  or  both, 
with  an  assumption  that  more  testing  is  generally  more  costly;  it  is  not  true  in  general, 
however,  that  more  testing  always  increases  our  knowledge  of  the  state  of  our  system.  This 
knowledge  is  tied  to  risk.  In  this  context,  risk  refers  to  the  degree  of  certainty  we  can 
achieve  (or  ambiguity  we  can  eliminate)  within  a  fixed  cost  constraint  or  within  the  power  or 
sensitivity  of  a  given  test  suite. 
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We  characterize  our  system  under  test  S  as  a  collection  of  modules  {Mi},  and  a  suite 
of  tests  {Tx}  used  to  interrogate  these  modules  (Figure  2).  These  tests  are  our  means  to 
identify  defective  modules,  or,  in  the  case  of  test-retest,  to  establish  with  high  probability  that 
no  defects  exist.  We  assume  that  tests  return  ambiguous  information  about  the  state  of 
modules  within  the  system;  that  is,  no  single  test  is  likely  to  return  perfect  knowledge  about 
a  particular  module. 


Figure  2.  Simple  Test  Coverage  on  Modules 

Note.  Notional  depiction  of  the  coverage  of  Tx  on  S,  with  multiple  modules  exercised  by  this 

test.  A  FAIL  result  from  Tx  indicates  that  at  least  one  of  the  subset  {Mh  Mh  Mk}  has  failed. 

Each  test  is  assumed  to  exercise  several  modules  (Figure  2),  and  several  tests  may 
exercise  the  same  module.  In  the  case  of  several  tests  covering  a  particular  module,  the 
framework  easily  accounts  for  overlapping  and  disjoint  coverages  (Figure  3). 

The  model  framework  described  in  Pfeiffer  et  al.  (2010,  2009a,  2009b)  presents  a 
useful  and  realistic  ambiguity  in  two  aspects.  The  first  is  that  a  test  is  rarely  assumed  to 
cover  or  exercise  all  functionality  of  a  module.  This  means  that  when  a  particular  test  Tx 
passes,  we  know  only  that  the  region  exercised  by  the  test  does  not  contain  a  defect;  a 
defect  may  still  exist  in  those  regions  not  inspected  by  the  test  (Figure  2).  A  second 
ambiguous  aspect  is  that  when  test  Tx  fails,  several  modules  may  be  at  fault  (Figure  2), 
though  such  a  result  should  significantly  reduce  the  number  of  suspect  modules  in  S. 


NP SI 


ACQUISITION  RESEARCH:  CREATING  SYNERGY  FOR  INFORMED  CHANGE  -  323 


Figure  3.  Overlapping  Coverage  Between  Tests 

Note.  The  overlapping  coverage  between  tests  Tx  and  Ty  are  characterized  with  the  arcs  Aix 

and  Ay-  The  joint  coverage  is  computable  as  the  intersection  of  these  arcs. 

The  vector  arcs  specifying  test  coverage  are  intended  to  lend  precision  to  the  model 
specification  and  implementation.  With  these  vector  artifacts,  the  overlap  among  tests  on  a 
single  module  can  be  precisely  specified,  and  the  disjoint  regions  can  be  similarly  specified 
(Figure  3).  In  the  original  work  (Pfeiffer  et  al.,  2009a),  we  proposed  that  subject  matter 
experts  could  estimate  these  coverage  data  as  a  starting  point  for  further  modeling  and 
simulation  work.  In  the  present  study,  we  further  examine  how  precise  these  estimates 
should  be  to  deliver  meaningful  decision  support  for  cost-effective  test  strategies. 

Analytic  Modeling  and  Computer  Simulation  Approach 

We  characterize  our  knowledge  of  the  system  under  test  S  as  a  vector  of 
probabilities  {bi}  that  any  given  module  Mj  is  bad.  Our  knowledge  of  the  system  is  perfect 
when  every  bi  is  either  0  (absolutely  good)  or  1  (absolutely  bad).  In  practice,  we  are  unlikely 
to  see  perfect  results  (e.g.,  b,  =  0  or  1 ),  though  we  can,  with  a  well-designed  test  suite  and 
an  effective  test  strategy,  minimize  the  residual  information  entropy  of  the  vector  (Pfeiffer  et 
al.,  2009a).  This  entropy  is  defined  following  Shannon  (1948): 

hi=-bi\og2bi-(l-bi)\og2(\-bi)  (1] 

The  initial  values  for  {bi}  are  assumed  to  be  available  from  subject  matter  experts  or 
a  priori  failure  rate  estimates.  Our  simulation  work  has  demonstrated  that  test  strategy 
outcomes  are  relatively  insensitive  to  these  initial  {bi}  because  of  the  iterative  nature  of  this 
approach.  That  is,  after  a  few  tests  have  been  executed,  the  initial  vector  {bi}  moves 
significantly  towards  lower  entropy  (Pfeiffer  et  al.,  2010).  The  test  coverages  connecting  the 
tests  {Tx}  to  the  modules  {MJ  appear  to  be  the  more  relevant  initial  criteria  in  these 
simulations  of  system  testing.  This  is  another  motivation  for  the  present  study. 
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Diagnosed  vs  True  State 
(Test  000) 


Diagnosed  vs  True  State 
(Test  012) 


[  M  Diagnosed  State  #  True  State| 

Diagnosed  vs  True  State 
(Test  026) 


I  U  Diagnosed  State  #  True  State  | 

Diagnosed  vs  True  State 
(Test  060) 


Module  Number 

I  Diagnosed  State  #  True  State | 


I  Diagnosed  State  #  True  State) 


Figure  4.  Diagnostic  Sequence  From  an  Idealized  Simulation 

Note.  From  simulation  results  in  the  decision  support  tool,  this  is  a  diagnostic  sequence  or 
trial  from  the  Monte  Carlo  simulation  of  testing.  In  this  trial,  a  single  defect  has  been  planted 
in  Module  11  (blue  ellipse),  and  testing  improves  the  knowledge  of  the  state  of  the  system  in 
the  probability  vector  {bi}  (red  squares).  Although  we  appear  to  have  a  good  diagnosis  by 
Test  026  (lower  left),  with  Module  1 1  identified  as  bad,  we  can  see  that  further  testing  more 
clearly  eliminates  all  other  modules  as  suspect.  This  would  be  most  important  in  a  regression 
test-retest  scenario. 

The  decision  support  tool  developed  in  Pfeiffer  et  al.  (2009)  simulates  the  testing  of  a 
complex  system  using  minimal  descriptions  of  tests,  modules,  and  their  connecting 
coverages.  For  idealized  simulations,  a  range  of  coverages  is  specified  between  tests  and 
modules,  coupled  with  a  target  number  of  tests  per  module  and  modules  per  test. 
Simulations  may  be  run  with  zero  or  more  defects.  The  zero-defect  case  is  particularly 
important  for  investigating  test-retest  or  regression  cases. 

Each  simulation  typically  involves  a  large  number  of  trials  (notionally  100  to  1000) 
and  this  facilitates  examining  the  bounds  of  the  idealized  assumptions.  The  diagnostic 
sequence  from  a  single  trial  is  depicted  in  Figure  4.  The  reduction  in  residual  entropy 
across  the  system  is  apparent  as  testing  progresses  from  the  initial  state  (Figure  4,  upper 
left)  to  a  usable  diagnosis  (Figure  4,  lower  right)  with  the  defective  module  correctly 
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identified.  The  system  entropy  is  computed  as  the  aggregate  of  residual  entropy  associated 
with  each  module: 


H  l°g2  -  (1  -  b, )  log,  (1  -  b, ) 

M 


(2) 


After  execution  of  a  test,  we  update  the  prior  probability  b,  of  each  module  Mi  to  the 
new  probability  b,’  based  on  the  test  outcome  (PASS  or  FAIL): 


b' 


P (B;  IP)  if  T  passes 
P(Bf  I  F*)  if  i;  fails 


These  conditional  probabilities  are  connected  to  test  coverages  through  the 
Bayesian  relations: 


m  i  px)  = 


P(Pa.  |  Bt)P(Bi) 

m) 


m  1 31 

p(pj 


pib  i f)  =  pW: i = 
1  *'  *’  P(FJ 


P(P r  I  B,) 
p (Fx) 


b. 


And  these  probabilities  are  computed  with  the  following: 


(3) 


(4) 

(5) 


n 

,=1  (6) 


p(p;)=i-n[i-^.i 

(7) 

Knowledge  of  the  coverage  dyad  {aix}  is  thus  intrinsic  to  minimizing  system  entropy 
(Equation  2)  and  developing  a  cost-effective  strategy  for  system  testing.  How  precisely 
must  these  coverages  be  specified  to  be  useful,  though? 


Simulation  Results 

In  previous  work  (Pfeiffer  et  al.,  2009a,  2009b),  we  have  investigated  the  relative 
performance  improvement  in  testing  strategies  using  a  best  next  test  (one-test  look  ahead) 
and  best  next  two  tests  (two-test  look  ahead).  The  coverages  for  these  investigations  were 
constructed  by  sampling  a  uniform  distribution  on  [0.1, 0.9]  for  each  connected  test  and 
module  pair. 

Results  from  Pfeiffer  et  al.  (2010)  suggest  that  the  best-next-two-tests  strategy  offers 
some  improvement  over  a  one-test  look  ahead,  though  the  time  required  developing  the 
two-test  look  ahead  is  on  the  order  of  2.5  times  the  one-test  strategy.  A  random  test 
selection  strategy  was  also  used  in  this  work  as  a  baseline  or  no-strategy  approach.  Both 
best  and  best-two  strategies  clearly  outperformed  this  random  approach. 
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In  this  work,  we  also  introduced  an  equivalent  metric  to  residual  information  entropy 
(Equation  1)  using  instead  the  maximum  probability  g,-  related  to  b,  by: 


q.  =  max(b,l  -  b) 


(8) 


This  measure  is  more  intuitive  than  Equation  2,  and  represents  an  expected  value  of 
a  replacement  (or  maintenance)  decision  with  respect  to  a  particular  module.  If,  for 
example,  a  particular  module  has  a  b,  =  0.70,  we  may  replace  it  knowing  that  this  informed 
guess  should  be  correct  70%  of  the  time.  This  also  means  that  in  30%  of  these  cases  we 
will  unnecessarily  replace  or  perform  more  granular  debugging  on  this  module.  Our  number 
of  correct  diagnoses  across  the  system  will  increase  as  each  b,  is  adjusted,  by  testing,  away 
from  b,= 0.5  towards  either  0  or  1  (Figure  4).  In  Pfeiffer  et  al.  (2009a),  we  have  shown  that 
minimizing  system  entropy  is  approximately  equivalent  to  maximizing  the  number  of  correct 
diagnoses. 

In  evaluating  the  best  next  test  (or  best  next  two  tests),  the  measure  (Equation  8)  is 
aggregated  as  a  system  measure  for  a  particular  test  Tx: 


i=  1 


(9) 


At  any  point  in  diagnostic  testing,  all  available  Tx  are  evaluated  with  Equation  9  and 
the  largest  Q(TX)  indicates  the  next  best  Tx.  The  conditional  probabilities  dependent  upon 
the  specification  of  coverage  (Equations  3-7)  are  intrinsic  to  this  computation. 

In  simulation  work  using  the  decision  support  tool  for  complex  testing,  we  examined 
the  sensitivity  of  test  strategies  to  the  specification  of  test  coverage  within  the  model. 
Specifically,  we  examined  both  random  and  best-next  test  strategies  with  different 
specifications  of  coverage  about  a  mean  coverage  per  module  of  0.7  or  70%.  All  {b,}  were 
initialized  with  a  maximum  entropy  value  of  b,  =  0.5,  consistent  with  our  assertion  that  the 
iterative  simulation  is  relatively  insensitive  to  the  initial  {b,}  (Pfeiffer  et  al.,  2009).  All  runs 
were  made  with  zero  defects  present,  to  emphasize  the  utility  of  this  work  for  test-retest  or 
regression  scenarios. 

The  fundamental  connection  of  coverage  to  information  (Equations  2,  6,  and  7) 
suggests  that,  in  general,  more  coverage  per  test  should  yield  more  information.  For  this 
investigation,  four  specifications  were  used:  a  uniformly  distributed  coverage  among  tests 
and  modules  from  50%  to  90%,  or  [0.5, 0.9];  and  fixed  coverages  of  50%,  70%  and  90%.  A 
nominal  300  trials  were  used  for  this  work,  though  the  model  output  statistics  were 
examined  for  1000  trials  without  significant  difference. 
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Random  Test  Simulations 
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Figure  5.  No-Strategy  (Random)  Testing  Simulations 

Note.  Simulation  results  using  different  coverage  specifications  for  a  random  test  selection  or 
a  no-strategy  approach  show,  in  general,  more  information  (smaller  H  or  larger  Q)  when 
coverage  per  test  increases. 

The  random  strategy  simulations  (Figure  5)  do  indeed  show  more  information 
returned  when  the  coverage  is  fixed  at  90%,  and  significantly  less  information  returned  when 
the  coverage  is  fixed  at  50%.  Perhaps  more  interesting  is  the  comparison  of  fixed  coverage 
at  70%  with  a  random  coverage  on  the  interval  [0.5,  0.9],  which  has  a  mean  of  70%.  These 
runs  for  the  random  strategy  appear  quite  similar  up  to  about  the  first  20  tests  (Figure  5).  At 
this  point,  the  fixed  coverage  at  70%  appears  to  outperform  the  random  coverage  on  [0.5, 
0.9], 

In  contrast  to  the  no-strategy  approach,  the  best  next  test  simulations  (Figure  6) 
show  pronounced  differences  among  coverage  specifications.  The  fixed  90%  coverage  run 
appears  somewhat  better  in  information  returned  per  test  execution,  though  interestingly  the 
70%  and  50%  coverage  runs  appear  to  underperform  compared  to  the  random  simulation 
(Figure  5).  These  differences  are  not  consistent  over  the  test  execution  profile,  however. 
This  is  particularly  interesting  because  both  the  random  and  best  next  simulations  were  run 
with  random  number  generators  seeded  identically;  thus,  the  differences  highlighted 
between  Figures  5  and  6  are  solely  a  function  of  the  differences  in  the  rate  of  information 
returned  by  the  two  strategies. 
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Best  Next  Test  Simulations 
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Figure  6.  Best-Next  Testing  Simulations 

Note.  Simulation  results  using  different  coverage  specifications  for  a  best-next  test  selection 
show  marked  differences  among  runs.  While  it  is  still  true,  in  general,  that  more  coverage 
yields  more  information,  the  uniform  random  coverage  [0.5,  0.9]  and  low-coverage  (fixed  at 
0.5)  runs  appear  to  underperform  when  compared  to  the  random  test  selection  (Figure  5). 


All  of  these  simulations  were  conducted  with  no  defects  present,  and  identification  of 
a  defect  tends  to  sharply  alter  the  information  profile  in  a  run;  intuitively,  this  is  because  the 
first  FAIL  result  in  test  execution  should  sharply  reduce  the  number  of  suspect  modules 
across  the  system.  In  the  absence  of  defects,  it  is  possible,  particularly  as  the  testing 
progresses  and  alters  the  vector  {/?/},  that  the  differences  among  tests  in  information 
returned  (Equation  9)  may  vary  widely  on  a  one-test  look  ahead. 

Consistent  with  our  previous  studies  (e.g.,  Pfeiffer  et  al.,  2010),  we  made  a  two-test 
look-ahead  simulation  to  better  assess  the  sensitivity  of  coverage  specification  to  test 
selection  strategy.  Overall  results  (Figure  7)  show  little  improvement  from  the  one-test 
strategy  (Figure  6),  though  the  best  performer  (fixed  coverage  at  90%)  does  show  some 
early  improvement  over  the  first  ten  tests  executed.  These  results  do  suggest  that  the 
effectiveness  of  a  test  selection  strategy  is  connected  to  the  precision  with  which  the  test 
coverages  are  specified. 
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Figure  7.  Best-Next-Two  Test  Simulations 

Note.  Simulation  results  using  different  coverage  specifications  for  a  best  next  two-test 
selection  also  show  marked  differences  among  runs,  similar  to  the  next-best-test  simulations 
(Figure  6).  Note  that  the  best  coverage  specification  (fixed  at  90%)  does  significantly 
improve  with  a  two-test  strategy,  though  the  other  fixed  runs  and  uniform  random  run  show 
little  improvement. 

We  should  keep  in  mind  that  these  idealized  simulations  place  no  constraint  on  the 
overlap  among  coverages  on  tests.  We  verified  in  simulation  log  files  that  significant  overlap 
among  tests  (e.g.,  Figure  3)  increased  as  the  fixed  coverage  progressed  from  50%  to  90%. 
The  impact  of  this  overlap  on  test  selection  appears  most  dramatic  in  the  non-random 
simulations  (Figures  6  and  7)  as  best  next  and  best-next-two  strategies  make  better  use  of 
the  information  returned  by  each  test.  This  overlap  also  means  that  many  or  most  of  the 
modules  in  the  idealized  system  were  completely  covered  by  some  number  of  tests  in  the 
test  suite,  leading  to  the  near  perfect  information  after  about  30  tests  have  been  executed 
(Figures  6  and  7).  We  expect  real-world  systems  would  rarely  achieve  perfect  coverage 
regardless  of  the  number  of  tests  available,  because  of  the  nature  of  complex  systems.  For 
anything  but  a  trivial  component  or  module,  we  are  unlikely  to  construct  a  set  of  tests  that 
cover  all  branch  paths  or  all  of  the  input  and  output  space. 

An  obvious  conclusion  from  these  results  is  that  more  coverage  per  test  appears  to 
improve  the  testing  process.  While  this  result  may  be  somewhat  intuitive,  an  equally  useful 
and  interesting  result  is  that  better  specification  of  coverages  increases  the  benefits  from  a 
rigorous  test  selection  strategy.  Further  investigation  with  both  real  coverage  data  from 
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operational  systems  and  idealized  simulations  with  more  complex  distributions  of  coverage 
should  yield  additional  insights  into  this  problem. 

Conclusions  and  Future  Work 

Effective,  cost-efficient  testing  and  re-testing  is  critical  to  the  long-term  success  of 
open  architecture.  Using  the  framework  for  complex  system  testing  developed  in  Pfeiffer  et 
al.  (2009),  we  have  conducted  additional  simulation  work  to  examine  the  sensitivity  of  test 
selection  strategies  to  the  specification  of  test  coverages.  Characterization  of  test 
coverages,  particularly  for  software-intensive  systems,  remains  a  difficult  challenge  (Zhu  et 
al.,  1997),  though  in  this  work  we  did  not  address  this  problem  directly.  Rather,  in  the 
framework  of  our  existing  model,  we  have  examined  the  impact  of  precision  in  specifying 
coverage  on  the  information  returned  per  test. 

Not  surprisingly,  the  test  selection  strategies  we  have  investigated  are  quite  sensitive 
to  various  specifications  of  test  coverage.  In  these  idealized  simulations,  less  precision  in 
coverage  specification  appears  to  flatten  the  information  returned  per  test.  Incorporation  of 
more  real  test  data  from  operational  systems  should  help  with  further  investigation  of  this 
point.  The  idealized  work  permitted  complete  (100%)  or  near-complete  coverage  of 
modules  with  overlapping  tests  (particularly  for  fixed  coverages  of  70%  to  90%)  that  would 
be  unlikely  in  real-world  testing.  In  fact,  we  speculate  that  in  simulating  real-world  systems, 
we  will  likely  encounter  test  scenarios  where  no  module  is  completely  covered  by  testing, 
and  real  coverages  are  at  best  95%  to  98%  with  all  overlapping  coverages  considered. 

The  decision  support  tool  used  in  these  simulations  could  also  be  further  refined  to 
permit  specifying  test-to-module  coverages  in  terms  of  a  collection  of  triangle  or  uniform 
distributions.  This  should  better  capture  subject  matter  expertise  in  a  quantitative  manner. 
For  example,  a  quasi-idealized  simulation  of  a  Garage  Door  Opening  System  could  work 
with  a  specification  that  the  Object  Detection  Test  exercised  at  least  30%  of  the  Remote 
Control  Module,  though  no  more  than  50%,  with  a  mode  or  mean  of  40%.  While  these 
numbers  may  still  be  speculative  or  notional  on  the  part  of  the  subject  matter  expert,  these 
confidence  bounds  would  represent  useful  input  to  the  overall  test  selection  strategy. 
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